Evaluation of Three-Dimensional Scenes Using Two-Dimensional Representations

ABSTRACT

A system adapted to implement a learning rule in a three-dimensional (3D) environment is described. The system includes: a renderer adapted to generate a two-dimensional (2D) image based at least partly on a 3D scene; a computational element adapted to generate a set of appearance features based at least partly on the 2D image; and an attribute classifier adapted to generate at least one set of learned features based at least partly on the set of appearance features and to generate a set of estimated scene features based at least partly on the set of learned features. A method labels each image from among the set of 2D images with scene information regarding the 3D scene; selects a set of learning modifiers based at least partly on the labeling of at least two images; and updates a set of weights based at least partly on the set of learning modifiers.

CROSS-REFERENCE TO RELATED APPLICATIONS

This application is a continuation of U.S. patent application Ser. No.13/736,060, filed on Jan. 7, 2013. U.S. patent application Ser. No.13/736,060 claims priority to U.S. Provisional Patent Application Ser.No. 61/583,193, filed on Jan. 5, 2012.

BACKGROUND

Many potential applications (e.g., robotics, gaming environments, etc.)may wish to utilize automated visual capture and/or analysis in order toevaluate virtual and/or physical three-dimensional (3D) environments invarious ways. Such applications may be limited by sensing equipment(e.g., a robot may have only a two-dimensional (2D) camera available),processing power, and/or other factors.

Existing algorithms for automated visual evaluation do not make use ofcombined information in 3D scenes and images at the same time. Someexisting solutions use multiple cameras to construct a three dimensionalrepresentation of a scene in order to measure 3D features by virtue ofmultiple images. Other existing solutions use 2D images and associated3D measurements (e.g., of a face) in order to create a model of a 3Dfeature (e.g., the face). Some existing systems utilize surfaces of anobject for identification (e.g., facial recognition). Some existingalgorithms estimate a shape from some other feature (e.g., motion orshading). In addition, some existing algorithms provide hierarchicalfeature selection. Some existing algorithms also utilize temporalslowness of features in an attempt to learn higher order visual featureswithout labeled data.

As can be seen, there is a need for a general purpose way to evaluatesets of visual features by exploiting the relationship between imagesand scenes which can be applied to a variety of visual evaluation tasks.

BRIEF SUMMARY

The present invention relates to the field of computer vision.Particularly, the invention relates to a system that is able to selectfeatures of an image and use a combination of such invariant features toperform one or more desired visual tasks.

A hybrid method of some embodiments combines bottom-up unsupervisedlearning of visual features with supervised learning that employs anerror function to evaluate the quality of mid-level representations ofscene features.

The system of some embodiments learns the relationships among scenefeatures, other scene features and image features using labeledexamples. By using computer rendered scenes, it is possible to, forexample, isolate the moments where features disappear or appear, thushaving algorithms that learn more precisely than by assuming thatfeatures persist (which is not always true).

Some embodiments infer the contents and structure of a visual scene froma two dimensional image. The forward problem of computer graphics can besolved mathematically, but the inverse problem of visual inference isill posed; there is no single solution. However, with the right set ofassumptions, a problem can become tractable. For example, a machinevision problem becomes much easier in a controlled setting where thelighting is bright and homogenous, and all objects are at a fixeddistance with a canonical view. Non-linear transforms of pixel intensitymay be sought in order to obtain features that are invariant to changesin illumination or viewpoint. Many such invariant features may beconstructed through a set of design rules, and then validated andoptimized on a particular set of classification tasks.

Biological visual systems of some embodiments must be able to supportmany different tasks. As a result, optimizing front end features for asingle task might impoverish other visual tasks. The system of someembodiments learns from one task and improves performance on other tasksbased on the learning. Such improvement occurs if the learning improvesthe mapping from appearances to true relevant features of a scene. Thelearning may be general, and the features invariant to the task at hand.Object recognition may be referred to as an example of a visual task,but the reasoning, algorithms and systems of some embodiments may applyto other visual tasks as well.

Some embodiments predict object identity using scene features, andnothing else. Full knowledge of the scene features would thus determinethe ceiling for performance on the task. Some embodiments are able tolearn a set of features that are optimized to perform estimation ofscene features (and thus likely to be invariant across tasks), and touse these same basis as inputs for one or more visual tasks.

One exemplary embodiment of the invention provides a system adapted toimplement a learning rule in a three-dimensional (3D) environment. Thesystem includes: a renderer adapted to generate a two-dimensional (2D)image based at least partly on a 3D scene; a computational elementadapted to generate a set of appearance features based at least partlyon the 2D image; and an attribute classifier adapted to generate atleast one set of learned features based at least partly on the set ofappearance features and to generate a set of estimated scene featuresbased at least partly on the set of learned features.

Another exemplary embodiment of the invention provides an automatedmethod adapted to provide learning about a three-dimensional (3D) sceneusing a set of two-dimensional (2D) images. The method includes:labeling each image from among the set of 2D images with sceneinformation regarding the 3D scene; selecting a set of learningmodifiers based at least partly on the labeling of at least two images;and updating a set of weights based at least partly on the set oflearning modifiers.

Yet another exemplary embodiment of the invention provides a computerreadable medium storing an image evaluation application adapted toenable learning about a three-dimensional (3D) scene using a set oftwo-dimensional (2D) images. The application includes sets ofinstructions for: labeling each image from among the set of 2D imageswith scene information regarding the 3D scene; selecting a set oflearning modifiers based at least partly on the labeling of at least twoimages; and updating a set of weights based at least partly on the setof learning modifiers.

The preceding Summary is intended to serve as a brief introduction tosome embodiments of the invention. It is not meant to be an introductionor overview of all inventive subject matter disclosed in this document.The Detailed Description that follows and the Drawings (or “Figures” or“FIGS.”) that are referred to in the Detailed Description will furtherdescribe the embodiments described in the Summary as well as otherembodiments. Accordingly, to understand all the embodiments described bythis document, a full review of the Summary, Detailed Description andthe Drawings is needed. Moreover, the claimed subject matter is not tobe limited by the illustrative details in the Summary, DetailedDescription and the Drawings, but rather is to be defined by theappended claims, because the claimed subject matter may be embodied inother specific forms without departing from the spirit of the invention.

BRIEF DESCRIPTION OF THE DRAWINGS

The novel features of the invention are set forth in the appendedclaims. However, for purpose of explanation, several embodiments of theinvention are set forth in the following drawings.

FIG. 1 illustrates a conceptual schematic block diagram of aninformation system according to an exemplary embodiment the invention;

FIG. 2 illustrates a flow chart of a conceptual process used by someembodiments to implement a hybrid method that uses supervised andunsupervised learning;

FIG. 3 illustrates a conceptual schematic block diagram of aninformation system that may use multiple feature types to estimatehigher order features, according to an exemplary embodiment of theinvention;

FIG. 4 illustrates a schematic block diagram of an information systemthat may use one or more sets of true scene features to optimize higherorder estimated scene features;

FIG. 5 illustrates a schematic block diagram of an information systemthat may include one or more sets of higher-order estimated scenefeatures;

FIG. 6 illustrates a flow chart of a conceptual process used by someembodiments to estimate a variety of scene features;

FIG. 7 illustrates a schematic block diagram of a conceptual system usedto implement some embodiments of the invention;

FIG. 8 illustrates a schematic block diagram of an alternativeconceptual system used to implement some embodiments of the invention;

FIG. 9 illustrates a side view of an object with a first set of visualproperties and another object with a second set of visual properties;

FIG. 10 illustrates a flow chart of a conceptual process used by someembodiments to provide object invariant representations of objects;

FIG. 11 illustrates a flow chart of a conceptual process used by someembodiments to evaluate multiple variables;

FIG. 12 illustrates a side view and a top view of an example objectlayout within a scene, an x-y plot of a cross section of a depth image,an x-y plot of a cross section of an estimate of the probability of anoccluding edge, and a 3D plot of a timeline;

FIG. 13 illustrates a flow chart of a conceptual process used by someembodiments to append dense labels to an image;

FIG. 14 illustrates a flow chart of a conceptual process used by someembodiments to estimate the joint probability of features andtransforms;

FIG. 15 illustrates a sequence of images used by some embodiments toestimate occluding edges of a transforming object and an x-y plot ofocclusion error over time;

FIG. 16 illustrates a flow chart of a conceptual process used by someembodiments to predict image properties using sequences of images;

FIG. 17 illustrates a conceptual process used by some embodiments togroup features;

FIG. 18 illustrates a flow chart of a conceptual process used by someembodiments to predict and apply future transformations;

FIG. 19 illustrates training processes used by some embodiments and twoexample configurations for combining supervised and unsupervisedlearning;

FIG. 20 illustrates a flow chart of a conceptual process used by someembodiments to train both supervised and unsupervised levels in a systemof some embodiments; and

FIG. 21 conceptually illustrates a schematic block diagram of a computersystem with which some embodiments of the invention may be implemented.

DETAILED DESCRIPTION

In the following detailed description of the invention, numerousdetails, examples, and embodiments of the invention are set forth anddescribed. However, it will be clear and apparent to one skilled in theart that the invention is not limited to the embodiments set forth andthat the invention may be practiced without some of the specific detailsand examples discussed.

Several more detailed embodiments of the invention are described in thesections below. Section I provides a conceptual overview of the schemeimplemented by some embodiments. Section II then describes conceptualsystems used by some embodiments to evaluate image data. Next, SectionIII describes various methods of operations provided by some embodimentsand provides various example implementations. Section IV then describescost-based feature analysis used by some embodiments. Lastly, Section Vdescribes a computer system which implements some of the embodiments ofthe invention.

I. Overview

Sub-section I.A provides a conceptual description of the flow ofinformation used by some embodiments. Sub-section I.B then describesestimation of higher order features using lower level learned features.Lastly, sub-section I.C describes a learning algorithm used by someembodiments.

Some embodiments provide a way to use various inputs that may be used insubsequent labeled learning. Such inputs may include, for example,linear transforms of an image, biologically inspired transformsmimicking the front end of a mammalian visual system (including but notlimited to the retina, visual thalamus, and primary visual cortex),normalization procedures such as luminance normalization, contrastnormalization and other features that may be devisively normalized. Aprocessing node within the network may represent its activation as ananalog value, a binary activation state, a probability, a beliefdistribution, a discrete state on N possibilities, a point process overtime, or any representation appropriate to the supervised learningalgorithm employed.

A standard framework for generating pairs of images and associated scenefeatures to train estimators of the scene features may be provided bysome embodiments. The framework may employ pixel aligned feature mapsthat can easily compare the visual support in the image to the groundtruth of the predicted features. The maps may include images of logicalvalues evaluated at every spatial location. Learning samples may bechosen according to a criterion, such as matching the number of positiveand negative exemplars, maintaining the relative frequency of features,etc. For this reason, not every “pixel” in an image may be used toupdate the learning rule. Often many of the pixels may be used to drivethe context that activates the system, so that features of the contextwill be learned if the features help to estimate the current scenefeature being trained.

A target feature sensitivity and spatial invariance may be defined by adesigner using the training signal for a particular scene feature.Analog values may be encoded as a range (such as between thirty andthirty-two degrees), defined by sensitivity per feature, and tilingdensity of the parameter. The feature values for the learning rule maybe binary, but may be deterministic or stochastic. In the later case, arange of values may by encoded by a kernal (such as a gaussian with apeak amplitude of one, or a boxcar with cosine rounded edges, etc.). Thespatial invariance may be encoded by, for example, a rule that takes oneof the following forms: “if at least one pixel with radius R hasproperty X” or “if at least fraction F of the pixels within radius Rhave property X.”

Some embodiments may provide a method for pruning the number of featuresin a layer of learned features. A large number of parameters may becomecomputationally intractable if there are many scene features beingestimated and large images, or large regions of neighborhoodconnectivity from input layers. Specifically, the higher order estimateshave many inputs. Thus, the training of the input may be done on onefeature at a time. Then, a greedy process may be used to add one scenefeature map at a time, until asymptotic behavior is achieved. Next, aremoval process may be used whereby each individual feature is removed,one at a time, with the least important feature at each step removed,until performance on the higher order classification is of asymptoticbehavior. A few rounds of alternating addition and subtraction offeatures may be included to confirm convergence, and to estimate theerror of the greedy feature selection process. The feature removal issimilar to a sparseness constraint (which sets some feature weights tozero) and other sparseness methods may be used to achieve similarresults. Features may be selected (and connections to the higher orderestimators maintained) by virtue of the contribution of the features toperformance on the scene estimation task, rather than an unknown futuretask.

In some embodiments, the strength of a weight change in an update rulemay be scaled by a co-efficient that depends on the temporal change of afeature. Features that persist over time may have a greater or lesserweight. Greater weights at temporal boundaries emphasize learning thedifferences among features that may often occur close in time, but havedifferent meaning. Additionally, the strength of a change may be scaledby a function that reflects the relative frequency of the featureaccording to an empirical or generative model of scenes.

A system of some embodiments may employ a network topology that includesa hidden layer between every scene estimate and an appearance layer, ahigher-order scene estimate that includes inputs of many different typesof first order scene estimates, and/or a spatial focus defined by aneighborhood connectivity rule across maps. Each estimated feature mayhave a pyramid representation over spatial scale, allowing appropriatecompression for larger spatial frequencies.

In some embodiments, a neighborhood connectivity rule may define aspatial basis for sample features of another type. For example, imagineestimating whether a set of features is an eye. Based on domainexpertise, a vertically oriented occluding edge (i.e., the edge of ahead) may be informative for improving the estimate of the presence ofan eye.

A spatial basis may tile a local region at multiple scales. Thus a lowspatial frequency occluding edge would activate a laterally displacedkernal in the basis. The basis may be defined by polar coordinates, withlarger regions included at larger eccentricities. In proximal regions, aspatial basis may include all local samples (locally connectivity is“all-to-all”). At larger radial distance, a single feature in thespatial basis may be the weighted average feature activity in thatregion. Such regions may include uniform non-overlapping sections, theymay also be weighted kernals that could overlap.

A spatial neighborhood basis may be constructed mathematically orempirically. Mathematically, such a spatial neighborhood basis may becreated to tile space like a dartboard by taking the product of asmoothed angular region and smoothed radial region. Empirically, such aspatial neighborhood basis may be measured directly (or a smoothedapproximation or parametric representation of the basis may begenerated). Such a bases may be generated by exposing the system tonatural images and saving the complete history of a reference maporganized into feature present and absent for a learned feature,performing an eigenvalue decomposition on the difference in thecovariance matrix of the reference map between present and absent (thesebases may be referred to as conditional Eigen images, as they provided abasis for discriminating whether a feature is present), and keeping onlythe most significant eigenvectors, while removing the rest. It may becomputationally intensive to perform this analysis for eachfeature—thus, if the bases are similar across features, they may betreated as canonical for other feature types.

Some embodiments utilize an iterative training process whereby thesystem is gradually built up. First appearance features may be generatedeither by design convention (e.g., a collection of parameterized Gaborwavelets, Laplace transform of Gaussian functions, scene invariantfeature transform (SIFT), etc.) or by learning from a database of images(i.e., applying). This stage does not require scene information. Second,scene estimators are trained by providing appearance features fromlocations in a scene that are appropriate for learning each scenefeature. Third, higher order scene estimates are trained that haveaccess to the ground truth of each of the “other” scene estimates.Fourth, training continues, but the specificity of the ground truth iscorrupted by noise that is proportional to the error magnitude of theestimators. Fifth, training continues, but scene ground truth isreplaced by scene estimates. As training continues, more and more sceneestimates are used, until the algorithm has no more dependence on thescene information, and generated higher order scene estimates usingnothing but the non-linear hierarchy that transforms appearance.Finally, a particular visual task is performed which has access of allof the learned features. Fine tuning may occur.

Thus, the final algorithm may be able to operate on images alone. Thepairing with scenes allows learning of the mapping to scene estimates,or to discover a rich basis that is capable of extracting suchinformation for subsequent computations.

A. Information Flow

FIG. 1 illustrates a conceptual schematic block diagram of aninformation system 100 according to an exemplary embodiment of theinvention. Specifically, this figure illustrates various elements anddata pathways that may be used to evaluate a 3D scene. As shown, thesystem 100 may include a 3D scene 105, a 2D image 110, a set ofappearance features 115, a set of learned features 120, a set ofestimated scene features 125, a set of true labels 130 associated with aspatial map of logicals 135, a learning rule 140, a parameter update atthe first level 145, and a parameter update at subsequent levels 150.

The 3D scene 105 may include data related to a 3D scene. Such data maybe utilized in various appropriate formats. The scene may be related toa virtual 3D environment (e.g., a gaming environment, a 3D modeledenvironment, a real-world environment, etc.).

The 3D scene may be rendered to provide at least one 2D image 110. Sucha 2D image may include data related to the 3D scene that is presented inan appropriate format for a 2D image. A 2D image is meant to representtwo physical dimensions, but may include multiple other dimensions ofdata (e.g., an image may include color data, and/or other such data).

The 2D image may be used to calculate a set of appearance features 115associated with the 2D image 110 (and thus the 3D scene 105). Theappearance features 115 may include various appropriate types offeatures (e.g., edges, wavelets, gradients, etc.). The generation of theappearance features 115 from the 3D scene 105 may be consideredpre-processing that formats the data in a way that is appropriate forfurther evaluation.

The set of learned features 120 may be generated based on the appearancefeatures 115 and the output “v” of the parameter update at the firstlevel 145 (i.e., the current state of the parameter “v” based on aprevious update, default condition, etc.). The set of learned features120 may be generated at least partly based on equation (1) below, whereequation (1) is one example of a forward transformation that may be usedby some embodiments.

y _(j) =g(Σ(x _(i) *v _(ij)))  (1)

In this example, “y” may be a one-dimensional vector that is calculatedbased on a non-linear function “g” that operates on a sum of the crossproducts of a one-dimensional vector “x” representing appearancefeatures 115 and a column j of a two-dimensional vector “v” representingthe weights at the first level update 145.

The learned features may include collections of co-occurring appearancefeatures. In one embodiment, these may be collections of wavelet basisthat may be used to predict the gradient of the surface normal at eachlocation in the image. In another embodiment, the learned features maybe collections of other appearance features that can predict orientedoccluding edges, albedo, 3D motion, surface texture of other sceneattributes, etc.

The estimated scene features 125 may be generated based on the learnedfeatures 120 and the output “w” of the parameter update at subsequentlevels 150 (i.e., the current state of the parameter “w” based on aprevious update, default condition, etc.). The set of estimated scenefeatures 125 may be generated at least partly based on equation (2)below, where equation (2) is one example of a forward transformationthat may be used by some embodiments.

z _(j) =g(Σ(y _(i) *w _(ij)))  (2)

In this example, “z” may be a one-dimensional vector that is calculatedbased on a non-linear function “g” that operates on a sum of the crossproducts of vector “y”, calculated above, and a column j of atwo-dimensional vector “w” representing weights from the parameterupdate at subsequent levels 150.

In this example, each processing node within the network (the state ofeach of multiple nodes in features 115-125) represents its activation asa numerical value. In other embodiments each node may have a binaryactivation state, a probability, a belief distribution, a discrete stateon N possibilities, a point process over time, or any representationappropriate to the supervised learning algorithm employed. For example,the forward transformation may increase the probability of activation ofa downstream node when an upstream node is activated. In anotherembodiment the forward transformation may add a weighted sum of apotentially unique kernal basis function from each upstream node todetermine the belief distribution of the downstream node. In anotherembodiment the forward transformation may map rank ordered discretestates of the upstream node to a monotonically increasing non-linearfunction “g”. In another embodiment the forward transformation mayadditively or multiplicatively increase or decrease the rate of thenon-homogenous point process instantiated in the downstream node. Inanother embodiment, the forward transformation additively ormultiplicatively combines a potentially unique kernal with a matrix ofthe Markovian transition probabilities between all states of thedownstream node.

Depending on the format that a particular system uses to represent theactivation of a node, there may be different updates of the weights “v”at the first level 145, or the weights “w” at subsequent levels 160.Such updates may be performed multiple times and/or each update mayinclude multiple weights.

In one embodiment the update rule may increase the gain or modify aparameterized shape of a kernal basis function which impacts the beliefdistribution of the downstream node. In another embodiment the updaterule may modify the shape of the non-linear function “g”, for example byshifting its slope or center or skewing the mass of a cumulativedistribution function that determines a monotonically increasingnon-linear function. Such a modification of the function “g” may beapplied to continuous value, probabilistic, discrete or other activationstate. In another embodiment, the update rule may increase or decreasethe impact of the upstream node's activation on the rate of thenon-homogenous point process instantiated in the downstream node. Inanother embodiment, the update rule modifies gain, or other parameter ofa kernal which, upon activation of the upstream node, is additively ormultiplicatively combined with a matrix of the Markovian transitionprobabilities between all states of the of the downstream node.

In some embodiments, the dimensionality of “v” and/or “w” is a matrixfrom each node in the upstream level to each node in the downstreamlevel. In some embodiments, some of the entries in these matrices arezeros, and so a different representation may be used to achieve the sameresult. In some embodiments, there is a single value at each location inthe matrix (also referred to as a “weight”). In other embodiments, thereare multiple parameters at each location in the matrix “v” and “w”. Forexample, a gain parameter (which may also be referred to as a “weight”),and additional parameters that determine the shape of a kernal, such asthe mean, variance and kurtosis of a generalized Gaussian distribution,may be used to update the state of a downstream node that represents aprobability distribution. In some embodiments, such an update acts uponthe weight or gain parameter. In other embodiments, the update rule mayact upon other parameters, such as the mean, variance, and/or otherparameter of a kernal distribution, or the location, slope, and/or otherparameter of non-linear activation function “g”.

The set of true labels 130, associated with a spatial map of Booleanvalues 135, may include available labels that are associated with the 3Dscene 105. The spatial map of Boolean values 135 may provide arepresentation of various features that may be associated with the 3Dscene 105.

The learning rule 140 may include a set of evaluation criteria that maybe used to compare the estimated scene features 125 to the true scenefeatures 130 in order to generate the parameter updates at the firstlevel 145, and the parameter updates at the subsequent levels 150. Theparameter updates 145-150 may be used to update the calculations used togenerate the learned features 120 and the estimated scene features 125.

Using such an information flow, some embodiments create features for afirst purpose (e.g., to estimate scene properties) and then a laterstage exploits the same features for use in other visual tasks. Such“features” may include appearance features, learned intermediatefeatures, lower-order estimated features, higher-order learned features,higher-order estimated features, etc. The features may be common to manycutting-edge front end visual processing systems. “Learned features” mayinclude a hidden layer that has no correct answer and “estimatedfeatures” may correspond to nodes in a network that are configured torepresent a particular scene feature. Learned features may be non-linearcombinations of appearance features. For example, the learned featuresmay be hidden units of a three-layer neural network. As another example,the learned features may be nonlinear transforms of data associated witha support vector machine.

Some embodiments may generate appearance features and learned featuresduring a “learning” operation. Such features may then be used at runtimeto generate estimated scene properties. Such appearance features,learned features, and/or estimated scene features may then be availableat runtime to perform a future unknown visual evaluation task.

One of ordinary skill in the art will recognize that although system 100has been described with reference to various specific elements andfeatures, the flow diagram may be implemented in various other wayswithout departing from the spirit of the invention. For instance,different embodiments may use various different forward transformationsthat may be associated with various different systems, environments,tasks, etc.

FIG. 2 illustrates a flow chart of a conceptual process 200 used by someembodiments to implement a hybrid method that uses supervised andunsupervised learning. Such a process may begin, for instance, when animage is analyzed by a system of some embodiments.

As shown, the process may be used to construct an estimator of mid-levelfeatures, P(f1|ap, f2, f3 . . . fn), that estimates a spatial map ofmid-level features (f1), given sparse appearance features (ap) and othermid-level features (f2, f3 . . . fn). The conditional dependencies ofmid-level features may be provided by sampling from a generative modelof scenes having 3D configurations with plausible objects, viewpointsand layouts. In some embodiments, the method may sample scenes from anexisting computer graphics framework (e.g., a 3D animated movie, anonline gaming environment, etc.). Operation 210 may requiresupplementary ground truth for mid-level features that are available for3D scenes.

Process 200 may train (at 210) dependencies between appearance featuresand mid-level features (ap<-->f1 through ap<-->fn). The process may thentrain (at 220) dependencies among mid-level features (f1 to fn).

The process may then improve (at 230) estimators for the mid-levelfeatures using supervised learning. One kind of labeled data may be leftout each time. Operation 240 may be applied to each kind of mid-levelrepresentation (e.g. oriented occluding edges, albedo, suffice normal,3D motion, surface texture, etc.).

Next, the process may apply (at 240) fine tuning to the entire systemand then end. Such fine tuning may include the dependencies learned fromthe appearance and the estimators of each of the features independently.This can be expressed as P(f1|ap, f2-hat, f3-hat . . . fn-hat), and mayoperate on pure image data, without requiring 3D labeled data, becauseeach of the mid-level features is explicitly estimated.

Although process 200 has been described with reference to variousdetails one of ordinary skill in the art will recognize that the processmay be implemented in various appropriate ways without departing fromthe spirit of the invention. For instance, the various processoperations may be performed in different orders. In addition, one ormore operations may be omitted and/or one or more other operationsincluded. Furthermore, the process may be implemented as a set ofsub-processes and/or as part of a larger macro process.

B. Estimating Higher-Order Features

FIG. 3 illustrates a conceptual schematic block diagram of aninformation system 300 that may use multiple feature types to estimatehigher order features, according to an exemplary embodiment of theinvention. Specifically, this figure shows the various elements and datapathways that may be used to implement the system 300. As shown, thesystem may include a 3D scene 305, a 2D image 310, appearance features315, multiple sets of learned features 320-330, multiple sets ofestimated scene features 335-345, and a set of higher-order estimatedscene features 350.

The 3D scene 305, 2D image 310, and appearance features 315 may besimilar to those described above in reference to FIG. 1. Each set oflearned features 320-330 may be similar to the learned features 120described above. Each set of estimated scene features 335-345 may besimilar to the estimated scene features 125 described above. Thehigher-order estimated scene features 350 may be based at least partlyon the appearance features 315, one or more sets of the learned features320-330, and/or one or more sets of the estimated scene features335-345.

The higher order scene features 350 may be generated using variousappropriate algorithms. For instance, in some embodiments the higherorder scene features use the same process as estimated scene features,but they have access to other estimated scene features as input. Inother embodiments, the higher order scene estimates use a differentlearning rule. Higher order scene features may also be a spatial arrayof predicted features, or a non-spatial attribute. Such higher orderscene features may include, for instance, the locations of faces in animage, the linear velocity of egomotion, or the rotational velocity ofegomotion.

Although system 300 has been described with reference to variousspecific details, one of ordinary skill in the art will recognize thatthe system may be implemented in various different ways withoutdeparting from the spirit of the invention. For instance, although theexample system includes three sets of learned features and threeassociated sets of estimated scene features, different embodiments mayhave different numbers of sets of features that may be associated invarious different ways.

FIG. 4 illustrates a schematic block diagram of an information system400 that may use one or more sets of true scene features 410-420 tooptimize higher order estimated scene features 350. System 400 may besubstantially similar to the system 300 described above in reference toFIG. 3. In contrast to system 300, however, system 400 may have accessto various sets of true scene features 410-420. Such true scene featuresmay be available from the 3D scene information 305 (e.g., features of avirtual 3D environment may be available to compare to learned and/orestimated scene features).

The true scene features 410-420 may allow the estimation of higher orderscene features 350 to be evaluated and/or improved. During the initialstages of training, the higher order features may be initially set to beequal to the weights between the true labels of another category and thelearned category. At a later stage, the same weights will be used as astarting point, but rather than using the activity corresponding to thetrue features, the algorithm uses the activity of the estimatedfeatures. Such an approach maximizes the probability that the weights ofthe forward pass encode the true desired transform, and not atransformation from an arbitrary reoccurring biased estimate of afeature to the higher-order estimated feature.

Although the system 400 has been described with reference to variousspecific details, one of ordinary skill in the art will recognize thatthe system may be implemented in various different ways withoutdeparting from the spirit of the invention. For instance, although theexample system includes two sets of true scene features, differentembodiments may access different numbers of sets of true scene features(e.g., one set, three sets, ten sets, a hundred sets, a thousand sets,etc.).

FIG. 5 illustrates a schematic block diagram of an information system500 that may include one or more sets of higher-order estimated scenefeatures 510-530. System 500 may be substantially similar to the systems300-400 described above in reference to FIGS. 3-4. In contrast to thosesystems, however, system 500 may include multiple distinct sets ofhigher-order estimated scene features 510-530. Although variouscommunications pathways have been omitted for clarity, in this example,each set of estimated scene features (e.g., set 510) may depend one ormore sets of learned features 320-330, one or more sets of estimatedscene features 335-345, and/or the set of appearance features 315.

In this example, the system 500 has not been optimized for anyparticular task, but is general purpose, able to be adapted to a varietyof visual evaluation tasks. For example, such a task could be to detectforest fires, estimate flow of traffic on a freeway, estimate thedensity of people in a crowd, track the swim path of a whale, estimateemotional state or cognitive alertness from a facial expression,estimate the ripeness of fruit, estimate the quality of manufacturing ofa product, determine the location of a threaded hole for the placementof bolt, evaluate a pipe for a leak, determine the health of an animal,assess the proper function of a mechanical system, or any other visualtask that may be performed using a sufficiently high-speed,high-resolution camera having an appropriate vantage point.

Although system 500 has been described with reference to variousspecific details, one of ordinary skill in the art will recognize thatthe system may be implemented in various different ways withoutdeparting from the spirit of the invention. For instance, the number ofnodes per feature type may be changed, affecting the spatial resolutionof the learned features, estimated scene features, and higher-orderestimated scene features. The rule for the presence of a feature may bechanged, such as the radius R for inclusion. The number of feature typescould vary, at any level in the system: appearance features, learnedfeatures, estimated features, and higher order estimated features. Asystem may have multiple levels of learned features, each with a forwardtransfer to the next level. The forward transfer may vary from onesystem to the next. The update rule may vary from one system to thenext. The parameter that is updated by the update rule may vary from onesystem to the next. The order of training of estimated features may varyfrom one system to the next, for example, if the labeled features aretrained in an interleaved fashion or in blocks, and the duration of theblocks.

C. Learning Algorithm

FIG. 6 illustrates a flow chart of a conceptual process 600 used by someembodiments to estimate a variety of scene features. The process may beimplemented, for example, using one or more of the systems 100-500described above. Process 600 may begin each time an image is madeavailable for analysis. The process may be executed iteratively for aset of associated images (e.g., frames of a video). Alternately theprocess may be executed on a queue of scenes and images that wereselected based on some rule. For instance, image-scene pairs could becollected from, for example, a world state for every i^(th) player andn^(th) second from a virtual world, a generative model of objects,layouts, and/or viewpoints, random viewpoints from a client globalpositioning system (GPS) rendered in various appropriate ways, a view ofan object from 3D library with a spherical background and illuminationmap, etc.

Next, process 600 may label (at 610) the image with scene information.Each item of scene information may be calculated per pixel (and/or otherappropriate delineations). The calculated scene information may include,for instance, whether the pixel is an occluding edge, whether anorientation is within a set of thresholds, whether a smoothed first,second, or third derivative of a surface is within a set of thresholds,whether a surface normal is within a set of thresholds, whether acoefficient (e.g., a Zernike coefficient) of a local surface is within aset of thresholds, whether an incident illumination is within a set ofthresholds, a property of a surface texture label (e.g., whether thelabel is hair, brick, skin, fabric, etc.), a common type of element(e.g., a lip, eye, door, agent, power outlet, etc.), whether aneffective color in white light is within a set of thresholds, etc.Generally, the scene information may include a function based on sceneproperties, camera positions, pixel location, etc.

Process 600 may then determine (at 620) whether the image is appropriateto render and learn. Such a determination may be made in variousappropriate ways. For instance, such a determination may be made atleast partly based on an applied sufficiency function, being rejected ifthe image (or portion of an image being analyzed) includes insufficientdata (e.g., due to a lack of a positive example in view of scene,attempting to analyze a region too close to the edge of image, etc.). Asanother example, a positive feature example according to an evaluationfunction may be selected. Such an evaluation function may have aparticular threshold (or set of thresholds) and may be chosen staticallyor dynamically (e.g., based on each feature to reflect a priorprobability based on a logical operator and one or more data sources,where such relative probabilities of features may be maintained and/orreflected in updates).

In general, the training of frequently occurring features may take along time, because there may be a large number of such features. Thus,some systems may update the entire system for only a subset of the totaloccurrences of a feature, but shift the magnitude of the updateaccordingly. The selection of the subset, both the identity and thefraction of the total set may vary from one system to another, becauseany computational speedup may come at the cost of unnecessarilyemphasizing spurious correlations.

When the process determines (at 620) that the image is not appropriate,the process may end (or may return to operation 610 after retrieving thenext image in a set of images being analyzed). Alternatively, when theprocess determines (at 620) that the image is appropriate, the processthen renders (at 630) the image. In some embodiments, the renderer maybe selected dynamically (e.g., based on a learning state). For instance,in some embodiments, a fast renderer may be selected when the number ofiterations is less than a pre-training threshold, and/or when a learningstate exceeds a threshold. Otherwise, a slow renderer may be used.

Alternatively to labeling and rendering an image as described above,process 600 may label images with scene attributes. Performing a largenumber of training samples may require an automated process. Such aprocess may be at least partially achieved by crowd sourcing a labelingtask rather than rendering the image.

Process 600 may then sample (at 640) the image. Such sampling may beperformed multiple times, where a set of operations is performed foreach sample. Each sample may be evaluated to determine, for instance, acenter context over a selected pixel (defined as an “attended pixel”),defining first scene and image identifiers, and transforming the imageby updating a 3D scene.

The samples may be used to identify, for instance, ego motion (e.g.,moving forward, veering right, rotating head, etc.), agent motion (e.g.,moving body, driving car, talking head, etc.), scale (zoom to or recedefrom approach), etc. In addition the samples may be used to translate(e.g., a Gaussian jitter of view angle with pre-defined variancethresholds), apply an advanced physics engine (e.g., falling objects,rising smoke, etc.), etc.

The image may be rendered after the transform and defined using secondscene and image identifiers. Some embodiments may store the samples in adatabase as labeled experiences that are available for batch learning.

Next, the process may evaluate (at 650) the image using a current visualalgorithm. The evaluation may be performed using the visual algorithm onthe first and second identified images. Results of the evaluation mayinclude, for instance, a response of an estimator for the currentfeature, response of the estimator for other features, response based onappearance features, etc.

Process 600 may then select (at 660) learning modifiers. Variousscenarios may occur depending on whether mid-level feature at a locationare the same or different between the first and second identifiedimages. The location may include a region centered at each sampled pixelin the first identified image. The region may have a spatial toleranceequivalent to, for instance, a particular radius. Because each scenarioincludes two images, there are four possible image contexts in which thelearning algorithm may update. The system may update based on theactivation in response to the first image and the second image. Thus, avector of eight values may be used to encode two changes for each imagepair. One change may be based on a combination of the response to thefirst image and one value in the vector, and another change may be basedon the combination of the response to the second image and another valuein the vector.

Table 1 below presents an example comparison matrix between a firstimage (Image 1) and a second image (Image 2). The first image and thesecond image may, in some cases, be consecutive frames in a video.Alternately, the second image may be a frame that is located multipleframes (e.g., 3, 5, 10, 20, 100, 1000, etc.) after the first image. Asanother alternative, the combination of the response to the first imageand the second image may be a combination of the average response to afirst number of frames “N: and the average response to a second numberof frames “M”, occurring a number of frames “T” later. As yet anotheralternative, the combination may be a weighted average of the responseto N frames combined with a different weighted average of the responseto M frames, separated by T frames.

TABLE 1 Image 1 POS NEG Image 2 POS persist appear NEG disappear absent

An update strength parameter may be calculated by multiplying a learningrate value by a matrix of values based on the results of the comparisonillustrated by Table 1. Such a matrix, or contextual gain vector, may berepresented as [persist1 persist2 appear1 appear2 . . . disappear1disappear2 absent1 absent2]. As one example, learning may be basedpurely on a label, where an example matrix may include values [1 1 −1 11 −1 −1 −1]. As another example, learning at a temporal boundary may beemphasized, where an example matrix may include values [1 1 −2 2 2 −2 −1−1]. As yet another example, learning may only occur at temporalboundaries, where an example matrix may include values [0 0 −1 1 1 −1 00]. As still another example, learning may avoid temporal boundaries,where an example matrix may include values [1 1 0 0 0 0 −1 −1].

Persist, appear, disappear, and absent values may be set for aparticular feature. In some cases, the values may be hand designed by anexpert for each selected dense label type. Many dense labels may beeffectively learned using a small set of possible eight-long updatevectors. In other cases, values may be selected from a list of commonlyused contextual gain vectors. In other cases, a particular vector may begenerated from first principle to achieve a certain ratio of emphasis onpersisting features vs. fluctuating features, or to allow the weights tosome features to slowly fade during long epochs of absent labels.

The matrix values may be any scalar number. The examples above werechosen for simplicity and to indicate larger values vs. smaller values,and where certain relationships are exactly balanced in magnitude butreversed in sign.

The learning rate value may have a larger magnitude initially, and thendecrease as training progresses. Such an approach is similar to“simulated annealing.” The initial value of the learning rate may have adifferent characteristic scale depending on the parameter being updated.

Each node has a label that was either 0 or 1 for each image, anddifferent actions should occur based on the values. One example updatemay be implemented as follows: “00”w=w+[response1−mean(response1)]*learning_rate*absent1+[response2−mean(response2)]*learning_rate*absent2;“01”w=w+[response1−mean(response1)]*learning_rate*appear1+[response2−mean(response2)]*learning_rate*appear2;“10”w=w+[response1−mean(response1)]*learning_rate*disappear1+[response2−mean(response2)]*learning_rate*disappear2; and “11”w=w+[response1−mean(response1)]*learning_rate*persist1+[response2−mean(response2)]*learning_rate*persist2.

More generally, each new weight may be written as a function of thefactors that affect the weight. The factors could be combined in variousdifferent ways. For instance, w=f (w, learning_rate, response1,response2, mean(response1), mean(response2), contextual_gain1,contextual_gain2). Labels may also be passed when appropriate, if theyinclude real values (as opposed to the Boolean values described inreference to FIG. 1.) For instance, w=f (w, learning_rate, response1,response2, label1, label2, contextual_gain1, contextual_gain2).

Process 600 may then determine (at 670) whether the selected learningmodifiers are appropriate for the current image. Such a determinationmay depend on various appropriate factors, such as the resolution of theimage, content of the image, color space of the image, etc. When theprocess determines (at 670) that the selected learning modifiers are notappropriate, the process may end. Alternatively, the process may update(at 680) various weights to be used in evaluating images.

The weights may be updated (at 680) in various appropriate ways based onvarious appropriate factors (e.g., whether using online learning orbatch learning). The weights may be updated at either one or twolocations (for instance, equations (1) and (2) described above). Allalgorithms may update the weights based at least partly on the learnedfeatures as related to estimated scene labels. This is effectivelyfitting a hyper-plane. Back-propagating the error in a neural network(ANN) allows the system to update the weights of the layer below: theconnection from appearance features to learned features. Other machinelearning algorithms (e.g., adaptive boosting, reinforcement learning,genetic algorithms, etc.) may either use standard feature sets (e.g.,for support vector machines) or may use random local features or theidentity function of the previous level.

Although process 600 has been described with reference to variousdetails one of ordinary skill in the art will recognize that the processmay be implemented in various appropriate ways without departing fromthe spirit of the invention. For instance, the various processoperations may be performed in different orders. In addition, one ormore operations may be omitted and/or one or more other operationsincluded. Furthermore, the process may be implemented as a set ofsub-processes and/or as part of a larger macro process.

II. System Architecture

Sub-section II.A provides a conceptual description of a systemarchitecture used by some embodiments to optimize local evaluation ofimage information. Sub-section MB then describes an alternative systemarchitecture that may optimize distributed evaluation of imageinformation.

Some embodiments include a client device (e.g., a mobile phone, acamera, etc.) and a server device that may be accessible over one ormore networks. During operation, the client device may send informationto the server related to an image under evaluation. The server may sendone or more task-specific “expert” modules to the client device forexecution and/or execute such expert modules and return data to theclient device.

In one example situation, a client device may capture one or more imagesof a flying bird. The server may identify expert modules that arerelated to things like flying birds (e.g., other flying objects, othermoving animals, etc.). The expert modules may be dedicated to a currentevaluation task (e.g., following the flight path of a bird and keepingthe camera in focus) and may be based at least partly on image data(and/or other data) sent to the server.

A. Local Implementation

FIG. 7 illustrates a schematic block diagram of a conceptual system 700used to implement some embodiments of the invention. Such a system maybe implemented on a client device that has minimal interactions with aserver device so as to provide fast response time. As shown, the systemmay include a set of accessible samples 705, a basic module 710, aserver 715, a first action 720, an expert module 725, and/or a sustainedaction 730.

Each sample in the set of accessible samples 705 may be retrieved froman accessible database and/or other appropriate storage element. Such asample may include visual information related to an image.

The basic module 710 may receive one or more samples for evaluation.Data related to each received sample may be sent to the server 715 forevaluation. The server may return information to the basic module of theclient device (e.g., an answer, categories and associated confidences,specialist information, etc.). The server may thus implement a firstaction 720 based on the received sample(s). The server may supply one ormore expert modules 725 to the client device based at least partly onthe information received from the client device. Once sent to the clientdevice, each expert module may operate on additional samples to providesustained actions 730 based at least partly on the received samples 705.

Although system 700 has been described with reference to variousspecific details, one of ordinary skill in the art will recognize thatthe system may be implemented in various different ways withoutdeparting from the spirit of the invention. For instance, differentembodiments may have different numbers of modules that may includevarious different communication pathways.

B. Distributed Implementation

FIG. 8 illustrates a schematic block diagram of an alternativeconceptual system 800 used to implement some embodiments of theinvention. As shown, the system may include a basic module 810 which maybe implemented on a client device (e.g., a mobile phone, a camera, a PC,etc.) and a server 820 with access to sets of expert modules 830-840. Inthis example, the server executes the expert modules rather than sendingthe modules to the client for execution.

In some embodiments, only a sub-set 830 of the available expert modulesis running at any given time, while another sub-set 840 may be unused inorder to save processing power.

Although system 800 has been described with reference to variousspecific details, one of ordinary skill in the art will recognize thatthe system may be implemented in various different ways withoutdeparting from the spirit of the invention.

III. Methods of Operation

Sub-section III.A provides a conceptual description of the generation ofobject-invariant representations used by some embodiments. Sub-sectionIII.B then describes multiple variable evaluation used by someembodiments. Next, sub-section III.C describes evaluation of sequentialimages performed by some embodiments. Sub-section III.D then describesprediction of subsequent image information provided by some embodiments.Next, sub-section III.E describes dense feature collection provided bysome embodiments. Sub-section III.F then describes grouping by someembodiments of multiple features. Lastly, sub-section III.G describesgrouping of transformation to predict subsequent image information insome embodiments.

A. Object Invariant Representation

FIG. 9 illustrates a side view of an object 910 with a first set ofvisual properties and another object 920 with a second set of visualproperties. In this example, the objects 910-920 may be similarly-shapedobject of different sizes (or different distance to a camera) and/or maybe otherwise related. The objects are shown as examples only and one ofordinary skill in the art will recognize that various differentlyshaped, sized, and/or otherwise differentiated objects may be evaluatedin a similar fashion to that described below.

FIG. 10 illustrates a flow chart of a conceptual process 1000 used bysome embodiments to provide object invariant representations of objects.Process 1000 will be described with reference to FIG. 9. Process 1000may begin, for instance, when a scene is being evaluated by someembodiments.

Next, the process may retrieve (at 1010) a set of samples of the scene.Each of such samples may include a set of pixels included in an imageassociated with the scene under evaluation. The sets of pixels may be ofvarying size and shape. Each sample may include the same size and shapeof sets of pixels such that the samples may be compared to similar othersamples and/or evaluation criteria.

The process may then determine (at 1020) task-independent similarity oftwo or more samples. Next, the process may determine (at 1030)similarity of two or more samples for a specific categorization task.The process may then end.

Such determinations may be based at least partly on various visualfeatures associated with images under evaluation. For instance, theobject 910 may have a similar ratio of space to shadow as the object920. As another example, the two objects 910-920 may have a similar edgecurvature of a particular edge, similar ratio of height to width, and/orother similarities that may associate the objects.

Although process 1000 has been described with reference to variousdetails one of ordinary skill in the art will recognize that the processmay be implemented in various appropriate ways without departing fromthe spirit of the invention. For instance, the various processoperations may be performed in different orders. In addition, one ormore operations may be omitted and/or one or more other operationsincluded. Furthermore, the process may be implemented as a set ofsub-processes and/or as part of a larger macro process.

B. Multiple Variable Evaluation

FIG. 11 illustrates a flow chart of a conceptual process 1100 used bysome embodiments to evaluate multiple variables. Such a process maybegin, for instance, when an image is evaluated by some embodiments. Theprocess may be used to identify associated sections of visualinformation (e.g., visual information associated with a banana viewedduring the day and at night).

Next, the process may receive (at 1110) a pattern of variation across asection of the image. The process may then retrieve (at 1120) historicalvariation in a measured variable. Such a measure variable may declineexponentially over time (e.g., a magnitude associated with the variablemay rise rapidly and then decay over time following an exponential decaypath). The process may then transform (at 130) the received pattern togenerate a second pattern based on a first variable with a fasttimescale and a second variable with a slow timescale (e.g., a variablewith exponential decay).

Although process 1100 has been described with reference to variousdetails one of ordinary skill in the art will recognize that the processmay be implemented in various appropriate ways without departing fromthe spirit of the invention. For instance, the various processoperations may be performed in different orders. In addition, one ormore operations may be omitted and/or one or more other operationsincluded. Furthermore, the process may be implemented as a set ofsub-processes and/or as part of a larger macro process.

C. Sequential Image Evaluation

FIG. 12 illustrates a side view 1200 and a top view 1250 of an exampleobject layout within a scene, an x-y plot of a cross section of a depthimage 1260, an x-y plot of a cross section of an estimate of theprobability of an occluding edge 1270, and a 3D plot of a timeline 1280.As shown, the side view 1200 includes a camera 1210, and several objects1220-1240 arranged within the view 1200. The top view 1250 includes thecamera 1210 and objects 1220-1240 as seen from the alternativeviewpoint.

The x-y plot of a cross section of a depth image 1260 indicates therelative depths of the objects 1220-1240 as rising edges along the depthaxis, where the x axis represents a horizontal position along the views1200 and 1250. The x-y plot of a cross section of an estimate of theprobability of an occluding edge 1270 indicates the estimatedprobability of occlusion on the vertical axis and the horizontalposition along the views 1200 and 1250 along the x axis.

The 3D plot of a timeline 1280 indicates a short video 1290 beingrecorded and an image 1295 being taken. Such a short video may allow foranalysis of images taken from multiple viewpoints (i.e., with a movingcamera). Alternatively and/or conjunctively, the short video may allowanalysis of movement of objects with a fixed camera position (e.g.,showing coherent motion such as an object travelling at a constantvelocity, jitter around one or more objects, etc.).

The parallax in the scene will result in some of the background beingoccluded on one side of a foreground object, and revealed on the other.Typically this is on the left or right side due to horizontaltranslation of the relative position of the camera and the foregroundobject, but it may be caused by relative motion in any direction. Objectrotations may also cause the appearance and disappearance of visualfeatures.

A generalization of the appearance or disappearance of visual featuresis a change from one pattern to the next, after accounting for globaldistortions of translation, expansion, rotation, shear, or othergeometric transformations that may be caused by egomotion or otherglobal movements. This pattern change is indicative of a moving boundaryof an object, and hence provides a probabilistic cue of an edge of anobject. Alternatively or additionally the changes caused by occlusionmay be detected by violations in the conservation of some resource thatis typically conserved, such as luminance.

Generally, the pattern is more reliable, but in some cases unaccountedfor luminance changes may be enough. In one method, the probability ofan object edge could be modeled as a Boltzman distribution where theenergy is set to a score of the pattern change, or luminance change, orsome other change. After accounting for egomotion or other globalmotion, the Boltzman constant may be determined by the magnitude of thescore process, and the temperature may be determined by the context. Thescore, the probability, or a threshold on either could be used as thedense label. Before passing on the labels for learning, some algorithmsmay benefit by a de-noising process that exploits the prior probabilityof the continuity of object boundaries.

The general strategy is to create an engineered solution for extractinga particular label, and providing this a training signal that acts onthe raw data. The system may be able to perform the same operation moreefficiently. Also the signal will be more compatible across domains,allowing the system to combine the signal with many other learned denselabels for general-purpose tasks. Compared to a 3D world a processedlabel from a camera may be less reliable, but the image data is morerealistic. Also images from cameras are likely to be easy to tailor to aparticular problem domain (which may not necessarily provide inputs in a3D world) including the lighting, noise level, resolution,auto-focusing, auto-white-balancing, or other camera settings that areappropriate for the problem domain.

Although the example of FIG. 12 has been described with reference tovarious details, one of ordinary skill in the art will recognize thatdifferent specific examples may include different numbers of objects,different layouts of objects, different depth and/or probabilitycalculations, etc.

FIG. 13 illustrates a flow chart of a conceptual process 1300 used bysome embodiments to append dense labels to an image. Process 1300 willbe described with reference to the example of FIG. 12.

As shown, the process may record (at 1310) a short video (e.g., video1290) that includes a sequence of images. Next, the process may capture(at 1320) an image. Such an image may be captured in various appropriateways.

The process may then evaluate (at 1330 the recorded video to determineknowledge that is not included in the captured image. Such knowledge mayinclude, for instance, depth of objects 1260, probability of occlusion1270, etc. Thus, for example, although an object in the captured imagemay be represented at a fixed position with no movement, movement of theobject may be detected by analyzing the video.

Next, process 1300 may encode (at 1340) any determined knowledge intothe captured image file using dense labels. Such encoding may utilize astructure similar to the image file representation (e.g., RGB) that maybe transparent to external systems but include information determined byexamining the recorded video, for instance. Such encoded information maybe used in various appropriate ways (e.g., to predict the path of anobject, to determine relative positions of a set of objects, etc.).

Although process 1300 has been described with reference to variousdetails one of ordinary skill in the art will recognize that the processmay be implemented in various appropriate ways without departing fromthe spirit of the invention. For instance, the various processoperations may be performed in different orders. In addition, one ormore operations may be omitted and/or one or more other operationsincluded. Furthermore, the process may be implemented as a set ofsub-processes and/or as part of a larger macro process.

D. Prediction of Subsequent Image Information

FIG. 14 illustrates a flow chart of a conceptual process 1400 used bysome embodiments to estimate the joint probability of features andtransforms. Such a process may begin, for instance, when a set of imagesis made available for evaluation. As shown, the process may determine(at 1410) a feature response for a first image. Next, the process maydetermine (at 1420) a feature response for a second image. The processmay then determine (at 1430) a transform based at least partly on thefeature responses of the first and second image. Next, process 1400 maydetermine (at 1440) mid-level features that are invariant to thetransform. The process may then determine (at 1450) a space of encodedactivity based on the transform. Lastly, process 1400 may represent (at1460) joint probability of the transform and the space and then end.

As one example, some embodiments may analyze a set of features todetermine that a sequence of images includes a rotating wheel. Thetransform may then be based at least partly on the speed of the wheel,while the space may include units that will represent a future image (ifthe speed is maintained).

To continue this example, the first image may induce a feature tostrongly respond to a grommet on the edge of the rotating wheel. In theresponse of the second image, the grommet will induce a new feature torespond in a different location. In this case, a single grommet wasdisplaced, and this may be consistent with a translation or rotation. Soafter the second image, there remains some ambiguity. However, ifmultiple grommets respond, the system may find a transformationconsistent with the pair of responses is a rotation about the axis ofthe wheel.

Such transforms are presumed to last for longer than two consecutiveframes and so the evidence gained from previous frames can be integratedto better predict the current transform, and thus better predict thenext location of the grommet on the wheel. Thus, even if there was someambiguity about a rotation or translation on the second frame, at thenext sample, the transform estimation may combine its current state,which contains historical information, with the new evidence.

Additionally the knowledge of the location of the feature (or collectionof features) may help to identify that it is, in fact, the same grommetthat is moving around, as opposed to randomly appearing and disappearinggrommits.

Since the system is not perfect, and may make mistakes, it may notproduce a confident answer if the evidence is weak. It will howeverstill indicate that some states are more likely than other byrepresenting a joint probability density across features, locations andtransforms. In some embodiments, these probabilities may be representedindependently, but often it is desirable for at least locations andtransforms to be represented jointly.

Although process 1400 has been described with reference to variousdetails one of ordinary skill in the art will recognize that the processmay be implemented in various appropriate ways without departing fromthe spirit of the invention. For instance, the various processoperations may be performed in different orders. In addition, one ormore operations may be omitted and/or one or more other operationsincluded. Furthermore, the process may be implemented as a set ofsub-processes and/or as part of a larger macro process.

E. Dense Feature Collection

FIG. 15 illustrates a sequence of images 1500 used by some embodimentsto estimate occluding edges of a transforming object and an x-y plot ofocclusion error over time. In this example, the first image 1505includes an object 1510 (or feature) at a particular position. Thesecond image 1515 includes the object at a different location 1520. Thethird image 1525 includes a shaded area that represents the backgroundof the first and second images 1505 and 1515. Such a background sectionmay be identified using a low-pass filter to identify regionallycoherent motion and/or other appropriate ways.

The fourth image 1530 represents an occlusion error as shaded area 1535.The error may be calculated based on the first and second images 1505and 1515 and may be calculated using a high-pass filter, for example.The x-y plot 1550 represents occlusion error over time and indicates apeak in relative velocity 1555, a propagation delay 1560 and anindication 1565 of when a photo is taken. The peak in the magnitude oferrors induced by occlusion, may be used to select a set of frames inthe preceding history (e.g., sequence 1290) which may be processed toreturn the most informative dense labels which may then be treated asscene features during of system (e.g., system 100).

FIG. 16 illustrates a flow chart of a conceptual process 1600 used bysome embodiments to predict image properties using sequences of images.Process 1600 will be described with reference to FIG. 15. Process 1600may begin, for instance, when a scene is being evaluated by someembodiments.

The process may then receive (at 1610) a first image (e.g., image 1505).Next, the process may receive (at 1620) a successor image (e.g., image1515). The process may then align (at 1630) the successor image to thefirst image using, for example, cross-correlation (and/or otherappropriate ways, such as minimizing the difference with a smoothdistortion map, or finding the affine transformation that best fitsreliable key points). Next, process 1600 may calculate (at 1640) adifference between the first image and the successor image.

Process 1600 may then calculate (at 1650) one or more edgediscontinuities in the motion field determined by the aligned images.Process 1600 may employ low pass (e.g., image 1525) and/or high passfilters (e.g., image 1530). Edge discontinuity in the motion field maybe calculated using a horizontal motion field that is filtered with ahigh pass filter. Alternatively, the direction of the motion flow fieldis not restricted to horizontal and/or the filter may be a band pass orlow pass filter. The spatial derivative may be calculated in a manneroptimized for a particular spatial scale. In other cases, a Bayesianmethod may be used, for example with a stick breaking prior. The processmay then predict (at 1660) image properties based on the calculateddiscontinuities.

Although process 1600 has been described with reference to variousdetails one of ordinary skill in the art will recognize that the processmay be implemented in various appropriate ways without departing fromthe spirit of the invention. For instance, the various processoperations may be performed in different orders. In addition, one ormore operations may be omitted and/or one or more other operationsincluded. Furthermore, the process may be implemented as a set ofsub-processes and/or as part of a larger macro process.

Some embodiments may utilize the image information in various differentways. For instance, when a moving object is identified, dense labels maybe used to indicate the speed of movement, direction of movement, etc.Such dense labels may be embedded into image data such that the labelsare transparent to external systems but are able to be read and utilizedby some embodiments.

F. Feature Association

FIG. 17 illustrates a conceptual process 1700 used by some embodimentsto group features. Such a process may begin, for instance, when a set ofassociated images is made available. As shown, the process may retrieve(at 1710) a video sequence that includes images. Next, the process mayprocess (at 1720) the image data using multiple feature types (suchprocessing and feature types may include filtering the image data usingdifferent filter types). The process may then compare (at 1730) thefeature outputs (e.g., the filter outputs). Next, the process mayidentify (at 1740) similarities in the feature outputs. Suchsimilarities may be identified in various appropriate ways using variousappropriate algorithms and parameters (e.g., by detecting significantcoherence in the phase of the spectral power within a temporal envelopethat matches the activity shared across a subset of nodes).

Process 1700 may then use (at 1750) the similarities to group sets ofassociated features. Such features may be grouped based on variousappropriate similarity criteria. The process may then store (at 1760)the generated sets of associated features such that the sets ofassociated features may be applied to future predictions.

Although process 1700 has been described with reference to variousdetails one of ordinary skill in the art will recognize that the processmay be implemented in various appropriate ways without departing fromthe spirit of the invention. For instance, the various processoperations may be performed in different orders. In addition, one ormore operations may be omitted and/or one or more other operationsincluded. Furthermore, the process may be implemented as a set ofsub-processes and/or as part of a larger macro process.

G. Transformation Association

FIG. 18 illustrates a flow chart of a conceptual process 1800 used bysome embodiments to predict and apply future transformations. Such aprocess may begin, for instance, when an image is made available foranalysis. As shown, the process may receive (at 1810) an image space(and an associated transform). The process may then determine (at 1820)transformations of the image space. Next, the process may generate 1830a higher order grouping of transformations. A transformation may then bepredicted (at 1840) based at least partly on the higher order grouping.Process 1800 may then predict (at 1850) a future space based at leastpartly on on the predicted transformation.

Process 1800 may be applied in various appropriate ways. For instance,some embodiments may determine acceleration of an object and use thedetermined acceleration to predict the velocity of the object. Inanother example, a person may be walking and begin to turn right whichmay allow a prediction that the person will continue to turn right (atleast for some expected time or distance). As another example, someembodiments may allow prediction of a flight path of a bird, with theability to recognize the different expectations regarding a bird that isflying generally horizontally at a steady pace and a bird that isstarting to dive.

Although process 1800 has been described with reference to variousdetails one of ordinary skill in the art will recognize that the processmay be implemented in various appropriate ways without departing fromthe spirit of the invention. For instance, the various processoperations may be performed in different orders. In addition, one ormore operations may be omitted and/or one or more other operationsincluded. Furthermore, the process may be implemented as a set ofsub-processes and/or as part of a larger macro process.

IV. Cost-Based Feature Analysis

FIG. 19 illustrates training processes 1900 and 1910 used by someembodiments and two example configurations 1920 and 1940 for combiningsupervised and unsupervised learning. Training process 1900 illustratesunsupervised learning across multiple levels (e.g., low, mid, high).Such unsupervised learning may include a process whereby the state atthe base of the arrow 1905 may impact the weight updates at the levelsalong the direction of the arrow (e.g., the activity at the low levelmay drive the mid-level, thus affecting the weights from low level tomid-level).

Training process 1910 illustrates supervised learning across multiplelevels (e.g., low, mid, mid-estimated). Such supervised learning mayinclude a process whereby the errors from the estimated features areused to assign blame to the nodes that most impacted the errors. Suchblame may proportionally affect the magnitude of the update to eachweight, according to a back propagation learning algorithm implementedalong the direction of arrow 1915. Such a scheme may be used to updatemultiple preceding levels.

The first example hybrid configuration 1920 illustrates combinedsupervised and unsupervised learning across multiple levels (e.g., fromtop to bottom, low, mid, mid estimate, high, and high estimate). The lowlevel features may be learning or designed, using state of the artfront-end features. The mid-level features may be learned unsupervisedfrom the low level (i.e., from the “bottom up”) initially, beforepropagating learning down each pass. The mid estimate may use supervisedtop down learning only, the high level may be generated usingunsupervised bottom-up learning initially, then also each down pass, thehigh estimate may be based on supervised top down learning only.Additional levels 1925 of unsupervised learning may be added, andadditional levels of supervised learning 1930 may also be added. Inaddition, supervised learning 1935 may affect multiple preceding layers.Such an approach may be particularly desirable for fine tuning at thelast stage, when a system is to be deployed for a particular task, orset of tasks, and the cost function of the task can be applied to impactthe back propagation.

The second example hybrid configuration 1940 illustrates an alternativecombination of supervised and unsupervised learning across multiplelevels (e.g., from top to bottom, low (e.g., appearance features 315),mid (e.g., learned features 320-0230), estimated mid (e.g., estimatedfeatures 335-345), high, mixing hidden layer (variable topologies may beeffective for different problems), second estimated layer, second mixinghidden layer, and task specific decision units).

Some embodiments may perform unsupervised learning 1945 for each learnedfeatures (these layers may learn different weights, for example if aratio of density to number of nodes differs). In some cases the weightsmay be the same across multiple learned features during initialtraining, but then diverge later. Some embodiments may performsupervised learning 1950 which back propagates costs. Unsupervisedlearning 1955 may be performed within each learned feature, and fromlearned features to the mixing hidden layer. Supervised learning 1960may back propagate costs. Unsupervised learning 1965 may proceeddirectly from the estimated feature to the second mixing hidden layer.Supervised learning 1970 may then back propagate costs (only one layerin this example). Supervised learning 1975 may back propagate costsacross the whole system.

Additional levels of unsupervised learning 1955 and 1965 may be added.Additional levels of supervised learning 1960 and 1970 may be added.Supervised learning 1975 may affect multiple (or all) preceding layers.

Mixing layers may be used to integrate the results of previous levelswith many types. Allowing a level before and after each estimated levelis a valuable design pattern that allows for rich mappings between oneestimated feature level and the next.

FIG. 20 illustrates a flow chart of a conceptual process 2000 used bysome embodiments to train both supervised and unsupervised levels in asystem of some embodiments. Such a process may begin, for instance, whena set of features is being trained. As shown, the process may learn (at2010) mid-level features from low-level features. Such learning may beperformed in various appropriate ways (e.g., using correlation-basedunsupervised learning such as Oja's rule).

Next, process 2000 may learn (at 2020) estimated features using a costfunction. Such learning may include, for each image, performing aforward pass from low level to mid-level, performing a forward pass frommid-level to estimated mid-level features, determining an errorassociated with estimate using ground truth measurements, determining acost of the error, and propagating the error back down the chain, frommid-level to low level, to update mid-level to estimated mid-levelproportionally and, optionally, to update low level to mid-levelproportionally.

The process may then learn (at 2030) high level features from mid-levelestimated features. Such learning may include, for each image,performing a forward pass from low level to mid-level, performing aforward pass from mid-level to estimated features, performing a forwardpass from estimated mid-level features to high-level and applyingcorrelation-based unsupervised learning.

Next, the process may learn (at 2040) high level estimated featuresusing a cost function. Such learning may involve, for each image,performing a forward pass from low level to mid-level, performing aforward pass from mid-level to estimated features, performing a forwardpass from estimated features to high-level, and performing a forwardpass from high level to estimated high level. An error associated withthe estimate may be determined using ground truth measures. The cost ofthe error may be determined and propagated back down the chain byupdating high-level to estimated high level proportionally and,optionally, updating estimated mid-level to high-level proportionally byupdating mid-level to estimated mid-level proportionally and updatinglow level to mid-level proportionally.

Process 2000 may then determine (at 2050) whether all features have beenevaluated for a category. If the process determines (at 2050) that notall features have been evaluated, the process may perform operations2010-2050 until the process determines (at 2050) that all features havebeen evaluated for a category at which point the process may determine(at 2060) whether all levels have been trained. When the processdetermines (at 2060) that not all levels have been trained, the processmay proceed (at 2070) to the next level and repeat operations 2010-2060until the process determines (at 2060) that all levels have beentrained.

When the process determines (at 2060) that all levels have been trained,the process may then fine tune (at 2080) all levels using a taskperformance cost function (where such a cost function may integratemultiple tasks).

In some embodiments, process 2000 may allow Alternating among supervisedand unsupervised levels. Each level may be learned sequentially, whereunsupervised does not require dense labels, supervised levels use denselabels, supervised learning always impacts the weights from the leveldirectly below, and supervised learning optionally impacts other levelsvia back propagation. Most learning is task independent. The last stageof learning may involve learning that back-propagates a cost sensitiveerror that integrates over multiple desired tasks.

Although process 2000 has been described with reference to variousdetails one of ordinary skill in the art will recognize that the processmay be implemented in various appropriate ways without departing fromthe spirit of the invention. For instance, the various processoperations may be performed in different orders. In addition, one ormore operations may be omitted and/or one or more other operationsincluded. Furthermore, the process may be implemented as a set ofsub-processes and/or as part of a larger macro process.

V. Computer System

Many of the processes and modules described above may be implemented assoftware processes that are specified as at least one set ofinstructions recorded on a non-transitory storage medium. When theseinstructions are executed by one or more computational element(s) (e.g.,microprocessors, microcontrollers, Digital Signal Processors (“DSP”),Application-Specific ICs (“ASIC”), Field Programmable Gate Arrays(“FPGA”), etc.) the instructions cause the computational element(s) toperform actions specified in the instructions.

FIG. 21 conceptually illustrates a schematic block diagram of a computersystem 2100 with which some embodiments of the invention may beimplemented. For example, the systems described above in reference toFIGS. 7-8 may be at least partially implemented using computer system2100. As another example, the processes described in reference to FIGS.6, 10, 11, 13-15, and 17-20 may be at least partially implemented usingsets of instructions that are executed using computer system 2100.

Computer system 2100 may be implemented using various appropriatedevices. For instance, the computer system may be implemented using oneor more personal computers (“PC”), servers, mobile devices (e.g., aSmartphone), tablet devices, cameras, and/or any other appropriatedevices. The various devices may work alone (e.g., the computer systemmay be implemented as a single PC) or in conjunction (e.g., somecomponents of the computer system may be provided by a mobile devicewhile other components are provided by a tablet device).

Computer system 2100 may include a bus 2105, at least one processingelement 2110, a system memory 2115, a read-only memory (“ROM”) 2120,other components (e.g., a graphics processing unit) 2125, input devices2130, output devices 2135, permanent storage devices 2140, and/ornetwork interfaces 2145. The components of computer system 2100 may beelectronic devices that automatically perform operations based ondigital and/or analog input signals.

Bus 2105 represents all communication pathways among the elements ofcomputer system 2100. Such pathways may include wired, wireless,optical, and/or other appropriate communication pathways. For example,input devices 2130 and/or output devices 2135 may be coupled to thesystem 2100 using a wireless connection protocol or system. Theprocessor 2110 may, in order to execute the processes of someembodiments, retrieve instructions to execute and data to process fromcomponents such as system memory 2115, ROM 2120, and permanent storagedevice 2140. Such instructions and data may be passed over bus 2105.

ROM 2120 may store static data and instructions that may be used byprocessor 2110 and/or other elements of the computer system. Permanentstorage device 2140 may be a read-and-write memory device. This devicemay be a non-volatile memory unit that stores instructions and data evenwhen computer system 2100 is off or unpowered. Permanent storage device2140 may include a mass-storage device (such as a magnetic or opticaldisk and its corresponding disk drive).

Computer system 2100 may use a removable storage device and/or a remotestorage device as the permanent storage device. System memory 2115 maybe a volatile read-and-write memory, such as a random access memory(“RAM”). The system memory may store some of the instructions and datathat the processor uses at runtime. The sets of instructions and/or dataused to implement some embodiments may be stored in the system memory2115, the permanent storage device 2140, and/or the read-only memory2120. Other components 2125 may perform various other functions. Thesefunctions may include, for instance, image rendering, image filtering,etc.

Input devices 2130 may enable a user to communicate information to thecomputer system and/or manipulate various operations of the system. Theinput devices may include keyboards, cursor control devices, audio inputdevices and/or video input devices. Output devices 2135 may includeprinters, displays, and/or audio devices. Some or all of the inputand/or output devices may be wirelessly or optically connected to thecomputer system.

Finally, as shown in FIG. 21, computer system 2100 may be coupled to anetwork 2150 through a network interface 2145. For example, computersystem 2100 may be coupled to a web server on the Internet such that aweb browser executing on computer system 2100 may interact with the webserver as a user interacts with an interface that operates in the webbrowser.

As used in this specification and any claims of this application, theterms “computer”, “server”, “processor”, and “memory” all refer toelectronic devices. These terms exclude people or groups of people. Asused in this specification and any claims of this application, the term“non-transitory storage medium” is entirely restricted to tangible,physical objects that store information in a form that is readable byelectronic devices. These terms exclude any wireless or other ephemeralsignals.

It should be recognized by one of ordinary skill in the art that any orall of the components of computer system 2100 may be used in conjunctionwith the invention. Moreover, one of ordinary skill in the art willappreciate that many other system configurations may also be used inconjunction with the invention or components of the invention.

Moreover, while the examples shown may illustrate many individualmodules as separate elements, one of ordinary skill in the art wouldrecognize that these modules may be combined into a single functionalblock or element. One of ordinary skill in the art would also recognizethat a single module may be divided into multiple modules.

While the invention has been described with reference to numerousspecific details, one of ordinary skill in the art will recognize thatthe invention can be embodied in other specific forms without departingfrom the spirit of the invention. For example, several embodiments weredescribed above by reference to particular features and/or components.However, one of ordinary skill in the art will realize that otherembodiments might be implemented with other types of features andcomponents. One of ordinary skill in the art would understand that theinvention is not to be limited by the foregoing illustrative details,but rather is to be defined by the appended claims.

I claim:
 1. A robotic device that implements a learning rule in athree-dimensional (3D) environment, the robotic device comprising: acamera that captures and renders a plurality of two-dimensional (2D)images associated with a 3D environment; a processor for executing a setof instructions; and a non-transitory medium that stores the set ofinstructions, wherein the set of instructions comprises: generating aset of appearance features based at least partly on a 2D image fromamong the plurality of 2D images; generating a set of learned featuresbased at least partly on each set of appearance features; and generatinga set of estimated environment features based at least partly on the setof learned features.
 2. The robotic device of claim 1, the set ofinstructions further comprising: evaluating the set of estimated scenefeatures and the learning rule; and updating a first set of parametersused to generate the set of learned features.
 3. The robotic device ofclaim 2, the set of instructions further comprising: evaluating the setof estimated scene features and the learning rule; and updating a secondset of parameters used to generate the set of estimated scene features.4. The robotic device of claim 3, wherein: each learned feature in theset of learned features is calculated based at least partly on anon-linear function applied to a sum of cross products of a vector ofappearance features and a 2D update matrix based at least partly on thefirst set of parameters; and each estimated scene feature in the set ofestimated scene features is calculated based at least partly on thenon-linear function applied to a sum of cross products of a vector oflearned features and a 2D update matrix based at least partly on thesecond set of parameters.
 5. The robotic device of claim 3, wherein the3D environment is a virtual environment.
 6. The robotic device of claim5, wherein the 3D environment comprises a set of true labels of scenefeatures, and each of the first and second sets of parameters is basedat least partly on the set of true labels.
 7. The robotic device ofclaim 6, wherein the set of true labels comprises a spatial map ofBoolean values.
 8. An automated method that appends dense labels totwo-dimensional (2D) images, the method comprising: recording a videocomprising a sequence of 2D images related to a three-dimensional (3D)scene; capturing a 2D image related to the 3D scene; evaluating thevideo to identify information that is absent from the captured 2D image;encoding the identified information into a file comprising the 2D image.9. The automated method of claim 8 further comprising using the encodedinformation to predict a path of an object within the 3D scene.
 10. Theautomated method of claim 8 further comprising using the encodedinformation to determine relative positions of a set of objects withinthe 3D scene.
 11. The automated method of claim 8, wherein theidentified information is encoded using dense labels.
 12. The automatedmethod of claim 8, wherein the 3D scene is associated with a virtualenvironment.
 13. The automated method of claim 8, wherein the 3D sceneis associated with a physical environment and the recording, capturing,evaluating, and encoding are performed by a robotic device having atleast one 2D camera and at least one processor.
 14. An automated methodthat predicts image information for at least one image in a sequence ofimages, the method comprising: determining a feature response for afirst image in the sequence of images; determining a feature responsefor a second image in the sequence of images; identifying a transformbased on the feature responses; and applying the transform to the secondimage to at least partly predict a third image in the sequence ofimages.
 15. The automated method of claim 14 further comprising:identifying mid-level features invariant to the transform; anddetermining a space of encoded activity based on the transform.
 16. Theautomated method of claim 15 further comprising representing a jointprobability of the transform and the space.
 17. The automated method ofclaim 14 further comprising: aligning the first image to the secondimage; calculating a difference between the first image and the secondimage; and predicting properties of the third image based at leastpartly on the calculated difference.
 18. The automated method of claim17 further comprising: calculating edge discontinuities in a motionfield associated with the aligned images; and predicting properties ofthe third image based at least partly on the calculated edgediscontinuities.
 19. The automated method of claim 14, wherein thesequence of images is associated with a three-dimensional scene.
 20. Theautomated method of claim 19, wherein the sequences of images comprisesa set of two-dimensional images captured over time.